Add support for different models in num_tokens_from_text function by vidhula17 · Pull Request #90 · microsoft/autogen

vidhula17 · 2023-10-03T11:45:59Z

Why are these changes needed?

This PR extends the num_tokens_from_text function to support a wider range of language models beyond the "gpt-x" series. It enhances code flexibility and welcomes community contributions for various models, improving project versatility.

Related issue number

Closes #63

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

vidhula17 · 2023-10-03T12:19:14Z

@microsoft-github-policy-service agree

thinkall

Thank you very much @vidhula17 for the PR, nice job! I've left some comments, could you please address them? Let me know if you need any help.

Thanks again for your contribution!

thinkall · 2023-10-03T13:04:20Z

autogen/retrieve_utils.py

+    """Return the number of tokens used by a text for different models."""
+
+    # Define token counts for known models
+    known_models = {


why gpt-3.5-turbo-0301 is not in the known model?

thinkall · 2023-10-03T13:13:00Z

autogen/retrieve_utils.py

+        "gpt-4-0613": (3, 1),
+        "gpt-4-32k-0613": (3, 1),
+    }
+


We can add a parameter to the function, say model_token: dict = None. And add below code to support customizing model token_per_message without modifying code here.

if isinstance(model_token, dict): known_models.update(model_token)

The parameter can be passed in retrieve_config in autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py

thinkall · 2023-10-03T13:14:29Z

autogen/retrieve_utils.py

+        if model == "your-new-model-name":
+            tokens_per_message = 3
+            tokens_per_name = 1
+        else:
+            raise NotImplementedError(
+                f"num_tokens_from_text() is not implemented for model {model}. See "
+                f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "
+                f"converted to tokens."
+            )


Suggested change

if model == "your-new-model-name":

tokens_per_message = 3

tokens_per_name = 1

else:

raise NotImplementedError(

f"num_tokens_from_text() is not implemented for model {model}. See "

f"https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are "

f"converted to tokens."

)

tokens_per_message = 3

tokens_per_name = 1

thinkall · 2023-10-03T13:17:41Z

autogen/retrieve_utils.py

+            )
+
+    # Use tiktoken to calculate the number of tokens in the text
+    encoding = tiktoken.encoding_for_model(model)


Suggested change

encoding = tiktoken.encoding_for_model(model)

try:

encoding = tiktoken.encoding_for_model(model)

except KeyError:

logger.warning("Warning: model not found. Using cl100k_base encoding.")

encoding = tiktoken.get_encoding("cl100k_base")

try...catch is needed here.

thinkall · 2023-10-03T13:17:53Z

test/test_num_tokens_from_text.py

+        with self.assertRaises(NotImplementedError):
+            num_tokens_from_text(text, model)


Need to update.

thinkall · 2023-10-03T13:22:15Z

Also, code format checking is failed, please run pre-commit install in the root folder of your local repo, and then you'll enable code formatting for your changes.

codecov-commenter · 2023-10-04T03:49:29Z

Codecov Report

Merging #90 (0130a98) into main (5ff85a3) will increase coverage by 0.30%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main      #90      +/-   ##
==========================================
+ Coverage   41.03%   41.33%   +0.30%     
==========================================
  Files          17       17              
  Lines        2091     2083       -8     
  Branches      469      467       -2     
==========================================
+ Hits          858      861       +3     
+ Misses       1156     1145      -11     
  Partials       77       77

Flag	Coverage Δ
unittests	`41.23% <75.00%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
autogen/retrieve_utils.py	`68.75% <75.00%> (+5.05%)`	⬆️

* Cancellation for model client #90 * format * Use future

Add support for different models in num_tokens_from_text function

3034a9b

vidhula17 had a problem deploying to openai October 3, 2023 11:46 — with GitHub Actions Failure

thinkall requested changes Oct 3, 2023

View reviewed changes

thinkall self-assigned this Oct 3, 2023

Merge branch 'main' into add-model-support-num-tokens

098e020

thinkall had a problem deploying to openai October 4, 2023 03:45 — with GitHub Actions Failure

Merge branch 'main' into add-model-support-num-tokens

4499dc2

thinkall had a problem deploying to openai October 5, 2023 14:48 — with GitHub Actions Failure

Merge branch 'main' into add-model-support-num-tokens

0130a98

qingyun-wu had a problem deploying to openai October 7, 2023 15:10 — with GitHub Actions Failure

thinkall mentioned this pull request Oct 8, 2023

Update num tokens from text #149

Merged

3 tasks

thinkall closed this Oct 9, 2023

jackgerrits pushed a commit that referenced this pull request Oct 2, 2024

Cancellation for model client #90 (#240)

c85da39

* Cancellation for model client #90 * format * Use future

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add support for different models in num_tokens_from_text function#90

Add support for different models in num_tokens_from_text function#90
vidhula17 wants to merge 4 commits intomicrosoft:mainfrom
vidhula17:add-model-support-num-tokens

vidhula17 commented Oct 3, 2023 •

edited

Loading

Uh oh!

vidhula17 commented Oct 3, 2023 •

edited

Loading

Uh oh!

thinkall left a comment

Uh oh!

thinkall Oct 3, 2023

Uh oh!

thinkall Oct 3, 2023

Uh oh!

thinkall Oct 3, 2023

Uh oh!

thinkall Oct 3, 2023

Uh oh!

thinkall Oct 3, 2023

Uh oh!

thinkall commented Oct 3, 2023

Uh oh!

codecov-commenter commented Oct 4, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-    encoding = tiktoken.encoding_for_model(model)
+    try:
+        encoding = tiktoken.encoding_for_model(model)
+    except KeyError:
+        logger.warning("Warning: model not found. Using cl100k_base encoding.")
+        encoding = tiktoken.get_encoding("cl100k_base")

		with self.assertRaises(NotImplementedError):
		num_tokens_from_text(text, model)

Comments

Conversation

vidhula17 commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

vidhula17 commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thinkall left a comment

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall Oct 3, 2023

Choose a reason for hiding this comment

Uh oh!

thinkall commented Oct 3, 2023

Uh oh!

codecov-commenter commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vidhula17 commented Oct 3, 2023 •

edited

Loading

vidhula17 commented Oct 3, 2023 •

edited

Loading

codecov-commenter commented Oct 4, 2023 •

edited

Loading